{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train/Test Split\n",
    "\n",
    "A common problem is that powerful models can perfectly fit the data on which they are trained. These models are often <b>low bias</b> and <b>high variance</b> However, we can't observe the variance of a model directly, because we only know how it fits the data we have rather than all potential samples. The goal of supervised learning and the models which we will learn about is to build a model that generalizes. It should accurately predict the future rather than the past.\n",
    "\n",
    "<b>Solution:</b> Use a procedure that <b>estimates</b> how well a model is likely to perform on out-of-sample data and use that to choose between models.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Dataset import\n",
    "from sklearn.datasets import load_iris\n",
    "\n",
    "# Decision tree based imports\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn import tree\n",
    "\n",
    "# For train test split and cross validation \n",
    "from sklearn.model_selection import train_test_split, KFold, cross_val_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the Data\n",
    "The Iris dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below loads the iris dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal length (cm)</th>\n",
       "      <th>sepal width (cm)</th>\n",
       "      <th>petal length (cm)</th>\n",
       "      <th>petal width (cm)</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n",
       "0                5.1               3.5                1.4               0.2   \n",
       "1                4.9               3.0                1.4               0.2   \n",
       "2                4.7               3.2                1.3               0.2   \n",
       "3                4.6               3.1                1.5               0.2   \n",
       "4                5.0               3.6                1.4               0.2   \n",
       "\n",
       "   target  \n",
       "0       0  \n",
       "1       0  \n",
       "2       0  \n",
       "3       0  \n",
       "4       0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = load_iris()\n",
    "df = pd.DataFrame(data.data, columns=data.feature_names)\n",
    "df['target'] = data.target\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create X and y variable to stores the feature matrix and target from the Iris dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a DataFrame for both parts of data; don't forget to assign column names.\n",
    "X = df[['sepal length (cm)',\n",
    "        'sepal width (cm)',\n",
    "        'petal length (cm)',\n",
    "        'petal width (cm)']].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(150, 4)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = df['target'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = y.reshape(-1,1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(150, 1)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Train and Test on the Entire Data Set (Do Not Do This)\n",
    "This is what we have been doing so far in this class for convenience, but it is a bad practice. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "1. Train the model on the **entire data set**.\n",
    "2. Test the model on the **same data set** and evaluate how well we did by comparing the **predicted** response values with the **true** response values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Build Model and Make Predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the model you want to use\n",
    "# We already did this at top of page, but repeating in case you wonder where this code comes from\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "# Make an instance of the model\n",
    "clf = DecisionTreeClassifier(max_depth = 5, \n",
    "                             random_state = 0)\n",
    "\n",
    "# Train the model on the data\n",
    "clf.fit(X, y)\n",
    "\n",
    "# class predictions \n",
    "predictions = clf.predict(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Measure Model Performance\n",
    "\n",
    "While there are other ways of measuring model performance (precision, recall, F1 Score, [ROC Curve](https://towardsdatascience.com/receiver-operating-characteristic-curves-demystified-in-python-bd531a4364d0), etc), we are going to keep this simple for now and use accuracy as our metric. \n",
    "To do this are going to see how the model performs on new data (test set)\n",
    "\n",
    "Accuracy is defined as:\n",
    "(fraction of correct predictions): correct predictions / total number of data points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy Score: 1.0\n"
     ]
    }
   ],
   "source": [
    "# calculate classification accuracy\n",
    "score = clf.score(X, y)\n",
    "print('Accuracy Score: {0}'.format(score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Problems With Training and Testing on the Same Data\n",
    "\n",
    "- Goal is to estimate likely performance of a model on **out-of-sample data**.\n",
    "- Maximizing the training accuracy rewards **overly complex models** that won't necessarily generalize.\n",
    "- Unnecessarily complex models **overfit** the training data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Overfitting](images/overfitting.png)\n",
    "\n",
    "*Image Credit: [Overfitting](http://commons.wikimedia.org/wiki/File:Overfitting.svg#/media/File:Overfitting.svg) by Chabacano. Licensed under GFDL via Wikimedia Commons.\n",
    "\n",
    "*Idea Credit: [@justmarkham](https://twitter.com/justmarkham)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train/Test Split (What we will mostly do in this class)\n",
    "\n",
    "1. Split the data set into two pieces: a **training set** and a **testing set**.\n",
    "2. Train the model on the **training set**.\n",
    "3. Test the model on the **testing set** and evaluate how well we did.\n",
    "\n",
    "What does this accomplish?\n",
    "\n",
    "- Models can be trained and tested on **different data** (We treat testing data like out-of-sample data).\n",
    "- Response values are known for the testing set and thus **predictions can be evaluated**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Undering train_test_split in python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on function train_test_split in module sklearn.model_selection._split:\n",
      "\n",
      "train_test_split(*arrays, **options)\n",
      "    Split arrays or matrices into random train and test subsets\n",
      "    \n",
      "    Quick utility that wraps input validation and\n",
      "    ``next(ShuffleSplit().split(X, y))`` and application to input data\n",
      "    into a single call for splitting (and optionally subsampling) data in a\n",
      "    oneliner.\n",
      "    \n",
      "    Read more in the :ref:`User Guide <cross_validation>`.\n",
      "    \n",
      "    Parameters\n",
      "    ----------\n",
      "    *arrays : sequence of indexables with same length / shape[0]\n",
      "        Allowed inputs are lists, numpy arrays, scipy-sparse\n",
      "        matrices or pandas dataframes.\n",
      "    \n",
      "    test_size : float, int or None, optional (default=None)\n",
      "        If float, should be between 0.0 and 1.0 and represent the proportion\n",
      "        of the dataset to include in the test split. If int, represents the\n",
      "        absolute number of test samples. If None, the value is set to the\n",
      "        complement of the train size. If ``train_size`` is also None, it will\n",
      "        be set to 0.25.\n",
      "    \n",
      "    train_size : float, int, or None, (default=None)\n",
      "        If float, should be between 0.0 and 1.0 and represent the\n",
      "        proportion of the dataset to include in the train split. If\n",
      "        int, represents the absolute number of train samples. If None,\n",
      "        the value is automatically set to the complement of the test size.\n",
      "    \n",
      "    random_state : int, RandomState instance or None, optional (default=None)\n",
      "        If int, random_state is the seed used by the random number generator;\n",
      "        If RandomState instance, random_state is the random number generator;\n",
      "        If None, the random number generator is the RandomState instance used\n",
      "        by `np.random`.\n",
      "    \n",
      "    shuffle : boolean, optional (default=True)\n",
      "        Whether or not to shuffle the data before splitting. If shuffle=False\n",
      "        then stratify must be None.\n",
      "    \n",
      "    stratify : array-like or None (default=None)\n",
      "        If not None, data is split in a stratified fashion, using this as\n",
      "        the class labels.\n",
      "    \n",
      "    Returns\n",
      "    -------\n",
      "    splitting : list, length=2 * len(arrays)\n",
      "        List containing train-test split of inputs.\n",
      "    \n",
      "        .. versionadded:: 0.16\n",
      "            If the input is sparse, the output will be a\n",
      "            ``scipy.sparse.csr_matrix``. Else, output type is the same as the\n",
      "            input type.\n",
      "    \n",
      "    Examples\n",
      "    --------\n",
      "    >>> import numpy as np\n",
      "    >>> from sklearn.model_selection import train_test_split\n",
      "    >>> X, y = np.arange(10).reshape((5, 2)), range(5)\n",
      "    >>> X\n",
      "    array([[0, 1],\n",
      "           [2, 3],\n",
      "           [4, 5],\n",
      "           [6, 7],\n",
      "           [8, 9]])\n",
      "    >>> list(y)\n",
      "    [0, 1, 2, 3, 4]\n",
      "    \n",
      "    >>> X_train, X_test, y_train, y_test = train_test_split(\n",
      "    ...     X, y, test_size=0.33, random_state=42)\n",
      "    ...\n",
      "    >>> X_train\n",
      "    array([[4, 5],\n",
      "           [0, 1],\n",
      "           [6, 7]])\n",
      "    >>> y_train\n",
      "    [2, 0, 3]\n",
      "    >>> X_test\n",
      "    array([[2, 3],\n",
      "           [8, 9]])\n",
      "    >>> y_test\n",
      "    [1, 4]\n",
      "    \n",
      "    >>> train_test_split(y, shuffle=False)\n",
      "    [[0, 1, 2], [3, 4]]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(train_test_split)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Understanding the `random_state` Parameter\n",
    "\n",
    "The `random_state` is a pseudo-random number that allows us to reproduce our results every time we run them. However, it makes it impossible to predict what are exact results will be if we chose a new `random_state`.\n",
    "\n",
    "`random_state` is very useful for testing that your model was made correctly since it provides you with the same split each time. However, make sure you remove it if you are testing for model variability!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below makes roughly 80% (this could change for future version of scikit-learn) of the data into a training set and the remaining into a testing set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![train_test_split](images/trainTestSplit.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X,\n",
    "                                                    y,\n",
    "                                                    random_state = 0,\n",
    "                                                    test_size =.20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(150, 4)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Original features matrix\n",
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(150, 1)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Original target vector\n",
    "y.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(120, 4)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(30, 4)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(120, 1)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(30, 1)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Build Model and Make Predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the model you want to use\n",
    "# We already did this at top of page, but repeating in case you wonder where this code comes from\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "# Make an instance of the model\n",
    "clf = DecisionTreeClassifier(max_depth = 3, \n",
    "                             random_state = 0)\n",
    "\n",
    "\n",
    "\n",
    "# Train the model on the training data\n",
    "clf.fit(X_train, y_train)\n",
    "\n",
    "# class predictions for the test set\n",
    "predictions = clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Measure Model Performance\n",
    "\n",
    "While there are other ways of measuring model performance (precision, recall, F1 Score, [ROC Curve](https://towardsdatascience.com/receiver-operating-characteristic-curves-demystified-in-python-bd531a4364d0), etc), we are going to keep this simple for now and use accuracy as our metric. \n",
    "To do this are going to see how the model performs on new data (test set)\n",
    "\n",
    "Accuracy is defined as:\n",
    "(fraction of correct predictions): correct predictions / total number of data points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy Score: 0.9666666666666667\n"
     ]
    }
   ],
   "source": [
    "# calculate classification accuracy\n",
    "score = clf.score(X_test, y_test)\n",
    "print('Accuracy Score: {0}'.format(score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Advantages of Train/Test Split</b>: Fast, simple, computationally inexpensive.\n",
    "\n",
    "<b>Disadvantages of Train/Test Split</b>: Eliminates data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-Folds Cross-Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Train/test split is useful, but it's a shame that we aren't using more data for training. \n",
    "\n",
    "**How can we use the maximum amount of our data points while still ensuring model integrity?**\n",
    "\n",
    "1. Split the dataset into K equal partitions (or \"folds\")\n",
    "    * So if k = 5 and dataset has 150 observations\n",
    "    * Each of the 5 folds would have 30 observations\n",
    "2. Use fold 1 as the testing set and rest is a training set\n",
    "    * Testing set = 30 observations (fold 5)\n",
    "    * Training set = 120 observations (fold 1-4)\n",
    "3. Calculate testing accuracy\n",
    "4. Repeat step 2 and step 3 K times, using a different fold as the testing set each time.\n",
    "    * 2nd iteration\n",
    "        * fold 4 would be the testing set\n",
    "        * combination of fold 1, 2, 3, and 5 would be the training set\n",
    "    * 3rd iteration\n",
    "        * fold 3 would be the testing set\n",
    "        * combination of fold 1, 2, 4, and 5 would be the training set\n",
    "    * 4th iteration\n",
    "        * fold 2 would be the testing set\n",
    "        * combination of fold 1, 3, 4, and 5 would be the training set\n",
    "    * 5th iteration\n",
    "        * fold 1 would be the testing set\n",
    "        * combination of fold 2, 3, 4, and 5 would be the training set\n",
    "5. Average all test accuracies to get the estimated out-of-sample accuracy.\n",
    "\n",
    "Although this may sound complicated, we are just training the model on k separate train-test-splits, then taking an average of the resulting test accuracies. This is more computationally intensive than train test split."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/cross_validation_diagram.png)\n",
    "\n",
    "There are many different variations of this procedure that we aren't covering in this class. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create a cross-valiation with five folds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on class KFold in module sklearn.model_selection._split:\n",
      "\n",
      "class KFold(_BaseKFold)\n",
      " |  KFold(n_splits='warn', shuffle=False, random_state=None)\n",
      " |  \n",
      " |  K-Folds cross-validator\n",
      " |  \n",
      " |  Provides train/test indices to split data in train/test sets. Split\n",
      " |  dataset into k consecutive folds (without shuffling by default).\n",
      " |  \n",
      " |  Each fold is then used once as a validation while the k - 1 remaining\n",
      " |  folds form the training set.\n",
      " |  \n",
      " |  Read more in the :ref:`User Guide <cross_validation>`.\n",
      " |  \n",
      " |  Parameters\n",
      " |  ----------\n",
      " |  n_splits : int, default=3\n",
      " |      Number of folds. Must be at least 2.\n",
      " |  \n",
      " |      .. versionchanged:: 0.20\n",
      " |          ``n_splits`` default value will change from 3 to 5 in v0.22.\n",
      " |  \n",
      " |  shuffle : boolean, optional\n",
      " |      Whether to shuffle the data before splitting into batches.\n",
      " |  \n",
      " |  random_state : int, RandomState instance or None, optional, default=None\n",
      " |      If int, random_state is the seed used by the random number generator;\n",
      " |      If RandomState instance, random_state is the random number generator;\n",
      " |      If None, the random number generator is the RandomState instance used\n",
      " |      by `np.random`. Used when ``shuffle`` == True.\n",
      " |  \n",
      " |  Examples\n",
      " |  --------\n",
      " |  >>> import numpy as np\n",
      " |  >>> from sklearn.model_selection import KFold\n",
      " |  >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])\n",
      " |  >>> y = np.array([1, 2, 3, 4])\n",
      " |  >>> kf = KFold(n_splits=2)\n",
      " |  >>> kf.get_n_splits(X)\n",
      " |  2\n",
      " |  >>> print(kf)  # doctest: +NORMALIZE_WHITESPACE\n",
      " |  KFold(n_splits=2, random_state=None, shuffle=False)\n",
      " |  >>> for train_index, test_index in kf.split(X):\n",
      " |  ...    print(\"TRAIN:\", train_index, \"TEST:\", test_index)\n",
      " |  ...    X_train, X_test = X[train_index], X[test_index]\n",
      " |  ...    y_train, y_test = y[train_index], y[test_index]\n",
      " |  TRAIN: [2 3] TEST: [0 1]\n",
      " |  TRAIN: [0 1] TEST: [2 3]\n",
      " |  \n",
      " |  Notes\n",
      " |  -----\n",
      " |  The first ``n_samples % n_splits`` folds have size\n",
      " |  ``n_samples // n_splits + 1``, other folds have size\n",
      " |  ``n_samples // n_splits``, where ``n_samples`` is the number of samples.\n",
      " |  \n",
      " |  Randomized CV splitters may return different results for each call of\n",
      " |  split. You can make the results identical by setting ``random_state``\n",
      " |  to an integer.\n",
      " |  \n",
      " |  See also\n",
      " |  --------\n",
      " |  StratifiedKFold\n",
      " |      Takes group information into account to avoid building folds with\n",
      " |      imbalanced class distributions (for binary or multiclass\n",
      " |      classification tasks).\n",
      " |  \n",
      " |  GroupKFold: K-fold iterator variant with non-overlapping groups.\n",
      " |  \n",
      " |  RepeatedKFold: Repeats K-Fold n times.\n",
      " |  \n",
      " |  Method resolution order:\n",
      " |      KFold\n",
      " |      _BaseKFold\n",
      " |      BaseCrossValidator\n",
      " |      builtins.object\n",
      " |  \n",
      " |  Methods defined here:\n",
      " |  \n",
      " |  __init__(self, n_splits='warn', shuffle=False, random_state=None)\n",
      " |      Initialize self.  See help(type(self)) for accurate signature.\n",
      " |  \n",
      " |  ----------------------------------------------------------------------\n",
      " |  Data and other attributes defined here:\n",
      " |  \n",
      " |  __abstractmethods__ = frozenset()\n",
      " |  \n",
      " |  ----------------------------------------------------------------------\n",
      " |  Methods inherited from _BaseKFold:\n",
      " |  \n",
      " |  get_n_splits(self, X=None, y=None, groups=None)\n",
      " |      Returns the number of splitting iterations in the cross-validator\n",
      " |      \n",
      " |      Parameters\n",
      " |      ----------\n",
      " |      X : object\n",
      " |          Always ignored, exists for compatibility.\n",
      " |      \n",
      " |      y : object\n",
      " |          Always ignored, exists for compatibility.\n",
      " |      \n",
      " |      groups : object\n",
      " |          Always ignored, exists for compatibility.\n",
      " |      \n",
      " |      Returns\n",
      " |      -------\n",
      " |      n_splits : int\n",
      " |          Returns the number of splitting iterations in the cross-validator.\n",
      " |  \n",
      " |  split(self, X, y=None, groups=None)\n",
      " |      Generate indices to split data into training and test set.\n",
      " |      \n",
      " |      Parameters\n",
      " |      ----------\n",
      " |      X : array-like, shape (n_samples, n_features)\n",
      " |          Training data, where n_samples is the number of samples\n",
      " |          and n_features is the number of features.\n",
      " |      \n",
      " |      y : array-like, shape (n_samples,)\n",
      " |          The target variable for supervised learning problems.\n",
      " |      \n",
      " |      groups : array-like, with shape (n_samples,), optional\n",
      " |          Group labels for the samples used while splitting the dataset into\n",
      " |          train/test set.\n",
      " |      \n",
      " |      Yields\n",
      " |      ------\n",
      " |      train : ndarray\n",
      " |          The training set indices for that split.\n",
      " |      \n",
      " |      test : ndarray\n",
      " |          The testing set indices for that split.\n",
      " |  \n",
      " |  ----------------------------------------------------------------------\n",
      " |  Methods inherited from BaseCrossValidator:\n",
      " |  \n",
      " |  __repr__(self)\n",
      " |      Return repr(self).\n",
      " |  \n",
      " |  ----------------------------------------------------------------------\n",
      " |  Data descriptors inherited from BaseCrossValidator:\n",
      " |  \n",
      " |  __dict__\n",
      " |      dictionary for instance variables (if defined)\n",
      " |  \n",
      " |  __weakref__\n",
      " |      list of weak references to the object (if defined)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(KFold)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Making ths process similar to the image in the previous section\n",
    "kf = KFold(n_splits=5, shuffle=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "~~~~ CROSS VALIDATION each fold ~~~~\n",
      "Model:  1\n",
      "Accuracy:  1.0\n",
      "Model:  2\n",
      "Accuracy:  0.9666666666666667\n",
      "Model:  3\n",
      "Accuracy:  0.8666666666666667\n",
      "Model:  4\n",
      "Accuracy:  0.9333333333333333\n",
      "Model:  5\n",
      "Accuracy:  0.7333333333333333\n"
     ]
    }
   ],
   "source": [
    "accuracy_list = []\n",
    "n= 0\n",
    "print(\"~~~~ CROSS VALIDATION each fold ~~~~\")\n",
    "for train_index, test_index in kf.split(X, y):\n",
    "    clf = DecisionTreeClassifier().fit(X[train_index], y[train_index])\n",
    "    score = clf.score(X[test_index], y[test_index])\n",
    "\n",
    "    accuracy_list.append(score)\n",
    "    print('Model: ', n+1)\n",
    "    print('Accuracy: ', accuracy_list[n])\n",
    "    n = n + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean of Accuracy for all folds: 0.9\n"
     ]
    }
   ],
   "source": [
    "print('Mean of Accuracy for all folds:', np.mean(accuracy_list))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9666666666666668"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf = DecisionTreeClassifier()\n",
    "\n",
    "# cross-validatation using a method (very similar to what we did in the code above)\n",
    "cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accuracy is different each time because the sampling is different each time. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing cross-validation to train/test split\n",
    "Advantages of **cross-validation:**\n",
    "\n",
    "- More accurate estimate of out-of-sample accuracy\n",
    "- More \"efficient\" use of data (every observation is used for both training and testing)\n",
    "\n",
    "Advantages of **train/test split:**\n",
    "\n",
    "- Runs K times faster than K-fold cross-validation\n",
    "- Simpler to examine the detailed results of the testing process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
